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Summary. We study the posterior distribution of the Bayesian multiple change-point regres- 
sion problem when the number and the locations of the change-points are unknown. While it is 
relatively easy to apply the general theory to obtain the 0(l/yn) rate up to some logarithmic 
factor, showing the exact parametric rate of convergence of the posterior distribution requires 
additional work and assumptions. Additionally, we demonstrate the asymptotic normality of the 
segment levels under these assumptions. For inferences on the number of change-points, we 
show that the Bayesian approach can produce a consistent posterior estimate. Finally, we ar- 
gue that the point-wise posterior convergence property as demonstrated might have bad finite 
sample performance in that consistent posterior for model selection necessarily implies the 
maximal squared risk will be asymptotically larger than the optimal 0(1/^) rate. This is the 
Bayesian version of the same phenomenon that has been noted and studied by other authors. 
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1. Introduction 

We consider the regression problem of estimating a piece-wise constant function when the 
number of segments as well as the locations of its change-p oints is unknown. This is 
an old problem that has attracted mu ch attention recently (jGoldenshluger et al.. 2006t 



Ben Hariz et al.. 2007t iFearnhead. 20081 ). Applications of multiple change-p oint models 



surged after e fficient computations using reversible jump MCMC was discovered (jGreen. 19951) 



( Green. 1995[ ) applied piece- wise constant function in the study of the coal mining dis 



aster data in the context of Poisson process. A more recent trend of analysis that dis- 
penses with the usage o f MCMC for the change-points problem starts with the paper 
Liu and Lawrence ^1999^ where a dynamic programming approach is utilized to marginal- 



ize over segment levels and change-point locations. Their original motivation comes from 
the problem of partitioning DNA sequence s into homogeneous segments . This dynamic 
programming approach is later extended bv IFearnhead f2006f ): iLian f 20081 ). 



Unlike the above studies, in this paper we are only concerned with the asymptotic prop- 
erties of Bayesian multiple change-point problems and investigate from the frequentist view 
the posterior contraction characteristics of a simplified model. Although a piece-wise con- 
stant function involves only a finite number of parameters, as we will only consider the case 
where an upper bound on the number of change-points is available a priori, it is nevertheless 
best studied from a infinite-dimensional viewpoint and put the estimation problem in the 
context of function spaces. Until recently, little is known about the beh avior of the pos - 



terior distribution of infinite-dimensional models. For consistency issues, ISchwartz (19651 ) 



shows that the posterior is consistent when certain tests can be established for the true 
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distribution versus the complement of its neighborhood. iBarron et al. ^1999^ further de- 



veloped the theory by sieve construction and metric entropy bounds. Convergence rates 
are studied in two in dep endent and to some extent ov erlapping but c omplementary works 
( Ghosal et al. (2000f) and lShen and Wasserman (200lh ). In particular. IChosal et al f2000l ) 



extends the idea of constructing suitable tests in orde r to bound the convergence rates 



for both nonpar ametric and parametric problems, and iGhosal and Van Per Vaart f2007l ) 



further extends the approach to non-i.i.d. observations. The existence of tests for many 
specific problems can be found in the existing literature although sometimes new tests need 
to be carefully designed. 

In nonparametric Bayesian analysis, we have an i.i.d. sample Zi, . . . , Z„ from the dis- 
tribution Pq with density po with respect to some measure on the sample space B). The 
model space is denoted by V which is known to contain the true distribution Pq. Given 
some prior distribution 11 onV, the posterior is a random measure given by 

" IUtMz.)duiP) ■ 

For ease of notation, we will omit the explicit conditioning and write 11" {A) for the posterior 
distribution. We say that the posterior is consistent if 

n"(F e V : d{P, Po) > e) ^ in Pq" probability 

for any e > 0, where d is some suitable distance function between probability measures. 

To study rates of convergence, let e„ be a sequence decreasing to zero, we say the rate 
is at least e„ if for sufficiently large constant M 

n"(P : Po) > Men) ^ in Po" probability. 

We also need a slightly weaker definition of rates of convergence by replacing M with a 
sequence Af„ and requiring that the above posterior mass converge to zero for any sequence 
M„ that diverges to infinity. This definition is usually required in parametric problems to 
get rid of the extra logn factor in the convergence rates. 

In our regression problem, we observe an i.i.d. sample Z = (Zi,...,Z„) with the 
distribution of = [Xi^Yi) defined structurally by 

for i.i.d Gaussian noise ~ 1). and Oq is a piece- wise constant function on [0, 1) with 
unknown locations of change-points. We can write 9o{t) = J^jLi'^j^i^j-i ^ t < tj),tQ = 
< ti < t2 < ■ ■ ■ < tkg = I using the indicator function and thus 9o is parameterized by 
(a, t), a = (ai, . . . , ako),t = (to, . . . ,tkg). For simplicity, we assume the marginal distribu- 
tions for {Xi} are i.i.d uniform on [0, 1), and note that it is straightforward to extend all 
the following results to any distribution of X with density bounded away from zero and 
infinity. Note Pq is fully determined by Oq under these assumptions, and thus we also use 
the space of piece-wise constant functions as our model space which is equivalent to using 
V. The measure induced by 9 is denoted by Pg, and thus Pg^ is the same as Pq, the true 
dis tribution. 

Lian studied the consistency issue for the above model with the exception that 



there Xi's are deterministically chosen on a grid. In that paper, consistency is proved for the 
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c ase that the true regression function is in the Lipschitz class as weh. Another related work 
is IScricciolo (20071 ) where the Bayesian density estimation problem is studied with density 



approximated by piece- wise constant functions. Besides the fact that they are interested in 
density estimation instead of regression, the focus of that paper is very different from the 
current one. They are mostly concerned with the case of approximating a smooth density 
using step functions and aim to achieve the optimal rates up to a logarithmic factor. For 
density functions that are piece- wise constant, they prove parametric rate of convergence 
also with an extra logarithmic factor. A diverging number of grid points is used and thus 
this approach cannot be used to estimate the number of segments when the density is truly 
piece- wise constant. 

One simplification of our model compared with those works mentioned at the beginning 
of this section that focus on the computational issues is that the variance of the noise is 
assumed to be known here (and actually 1 without loss of generality). Investigations of 
Bayesian regression with unknown noise levels can run into additional technical difficulties 
especially in the design of appropri ate tests. Consistency of a regression problem with 
unknown noise level is addressed in IChoi and Schervish (20071 ) . We hope to be able to 
address this problem in our context in a future paper. 

In this paper, we focus on the case Oq is piece-wise constant and aim to achieve the 
exact parametric 0{l/y/n) rates of convergence and also study the posterior consistency 
in the estimation of the number of change-points, which we refer to as the model selection 
pr oblem. The proofs f or the estimation rates involve direct application of general theorems 
in Ghosal et al. but the calculation of the covering number is nontrivial in this case. 



In order to achieve the exact parametric rate, an additional assumption needs to be made 
to exclude functions with segment lengths that are too short. 



2. Main results 

Consider the case where we have some a priori bounds for the number of change-points as 
well as for the segment levels {aj}. The model space is defined as 

k 

e = {6* : e{t) = J2 < t < = < ii < . . . < tfc = 1, fc < kmax, \aj\ < K} . 

By convention, we say 6 with tk = 1 has k change-points, which is the same as the number 
of segments. 

Another equivalent representation of 8 is 8 = {{a,t) E [—K,K]'' x Tk : 1 < k < kmax, 
where is the set of (fc -|~ l)-tuplcs {tg, . . . ,tk) with tj < tj+i- Wc will not distinguish 
between these two different representations and can denote either a function or the tuple 
(a,t). This ambiguity can always be resolved by the context. 

For rates of convergence, the distance d we use is the L2 norm of the function \\9\\ = 

[Jo (^^ix)dx\ . Since we only consider uniformly bounded functions, the L2 norm is 



equivalent to the HcUingcr distance (e.g lGhosal and Van Per Vaart (20071 ). section 7.7). 



Wc now specify a prior on 8 using a hierarchical approach. Let 8^ be the subspace of 
8 that consists of functions with k change-points, and the prior 11 is specified as a mixture 

u=J2 p(fc)nfc, P{k) > 0, ^p(fc) = 1 

k=l k 
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with life the prior measure on Ofc. We assume that 11^ has a density 7rfc(0) which can be 
further decomposed as 

The assumption we make on the prior is that 

(A) The density Tr^{a\t) andTr^{t) are bounded away from zero and infinity on [—K,K]'' 
and Tk respectively. 

This assumption is satisfied, for example, when ti, i2, ■ • ■ , tk-i are distributed as the or- 
der statistic of k— 1 points uniform distributed on [0, 1) while segment levels are independent 
and uniformly distributed on [—K,K]. 

The first simple result shows that the posterior rate of convergence is n^^^^ up to a 
logarithmic factor. 

Theorem 1. Under assumption (A), the posterior rate of convergence is at least e„ = 
log^/^ n/n^/^, i.e. W{9 : ||6' - ^oH > Afe„) in P^- probability for sufficiently large M. 

Theorem [1] considers the convergence rates of the estimation problem. A different prob- 
lem is the convergence of the posterior for the number of change-points. Under no additional 
assumptions, we can show that the posterior probability will concentrate on the true number 
of change-points with probability converging to 1. 

Theorem 2. Under the same assumption (A), we have n"(fc = ko) 1 in Pq proba- 
bility, where fco is the number of change-points for the true function Oq . 



Nonparametric Bayesian model selection has been investigated in lGhosal et al. 



The focus of that paper is on conditions under which the adaptive rates are achieved when 
simultaneously considering models with different rates of contraction. Thus it seems the 
results presented there cannot be directly applied in our case. 

To get rid o f the extra logarithm ic factor in Theorem [1] one would use the more refined 
Theorem 2.4 in Ghosal et al. using local covering number instead of the global one. 



Nevertheless, as shown by Lemma|4]in the appendix, the local covering number for 6 is not 
bounded as would be required if we set e„ = 0{l/y/n). Instead, we consider the smaller 
model space 

= {6 eQ: min \tj - | > 5} . 

We can define O^. in a similar way and assumption (A) can be modified accordingly. Spec- 
ification of a prior on is easy. Conceptually, we can just restrict 7r|,(i) to be supported 
and renormalize the density. Reversible jump algorithms can be easily modified to take 
into account the constraint. Dyn amic program ming can also incorporate the pre-specified 
shortest possible segment length (jLian (2007bh . Theorem [Hand Theorem [2] is still true on 



9*^ with few modifications on the proofs. 



Green fl995f ) also noticed the practical advantage of avoiding short steps. They proposed 
using even-numbered order statistics from 2fc — 1 uniformly distributed points so that short 
segment lengths are better penalized. 

As shown in the appendix, putting some lower bound on the segment lengths makes 
the local covering number bounded by a constant. This requires a very detailed argument 
to construct the covering. Using this more refined bound on the covering number, we can 
achieve the exact parametric rate. 



Bayesian change-point problem 5 



Theorem 3. For any S > 0, under assumption (A), the posterior rate of convergence 
on is at least e„ = 0(l/y^). That is, for every Mn — > oo, we have that n"(0 € 6*^ : 
1 16* - 6*011 > M„e„) -^0 in probability. 

Combination of Theorem [2] and [3] immediately gives us the rates of convergence for the 
change-point locations: 

Corollary 1. Under the same assumptions as above, the posterior convergence rate 
for the change-point locations is at least = 0{l/n), that is , for any sequence M„ —^ oo, 
n"(maxi<j<fcQ \tj — t'jl > MnC^J Q in Pq probability. This rate of course agrees with 
many frequentist approaches, say using the cumulative sum. 

It is well-known that the posterior distribution in regular parametric models condition- 
ally converges to a Gaussian distribution under weak conditions. Since the previous results 
show that the number and locations of the change-points can be consistently estimated, one 
would naturally conjecture that the posterior distribution for segment levels will converge 
to a multivariate Gaussian distribution. This is indeed the case as stated in the following 
theorem: 

Theorem 4. Suppose the true segment lengths are Ij = t'j — t^_i,j = 1, . . . , fco. Denot- 
ing the posterior distribution of a ~ (ai, . . . ,afc) restricted on the event k = kg (which has 
a posterior probability converging to 1) by TT^^^ and the covariance matrix Iq — diagilj ■ n), 
then we have 

EzieXlz - Niaito),Io)\\TV ^ , 

where \ \P—Q\\tv is the total variation distance between probability measures P andQ, a{to) 
is the maximum likelihood estimator for a, assuming the true locations of the change-points 
are known. 

The above theorems show that the Bayesian procedure possesses very good properties. 
On the one hand, the exact parametric rate is achieved for the estimation problem in 
the function space. On the other hand, the number of change-points can be consistently 
estimated. This is reminis cent of the recent literature on the oracle property of penalized 



estimators. As shown in iFan and Li (20011 ). the SCAD estimator, when the smoothing 
parameter is chosen appropriately, can estimate the zero coefficients in a linear regression 
model as exactly zero with probability converging to one as sample size increases. At 
the same time, the estimator is still consistent for nonzero coefficients and the asymptotic 
distribution is the same wh ether or not the c orrect zero posi tions are known. This is 



called the oracle property bv lFan and Li f2001 ). More recently, Leeb and Potscher (2008 ) 



showed that the oracle property might be misleading in terms of the estimator's finite 
sample performance and it is impossible to adapt to the unknown zero restrictions without 
pa ying a price. The caveat lies in the point-wise nature of the asymptotic theory laid out 



Fan and Li f200l[ ). The authors of iLeeb and Potscher (20081 ) show that an unbounded 



(norm alized) risk re sults for any estimator possessing the sparsity property. Another related 
work is lYang (20051) where the author shows that AIC has a minimax property which cannot 
be shared with any model selection consistent estimators in a regression problem. 

In our context, similar conclusion can be drawn for the Bayesian multiple change-point 
problem. Theorem [2] and Theorem exactrate apply to a fixed true piece-wise constant 
function and thus the convergence as stated is point-wise in nature. It is not difficult to 
see from the proof of Theorem [3] that the 1 / i/n rate is not actually uniform over the class 
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. The reason is that to obtain the bound for the focal covering number (Lemma [5] in the 
appendix), the constants involved does depend on 9q. In particular, the derivation of the 
lemma requires a lower bound on the size of the jumps of the neighboring segments and 
thus the convergence is not uniform over . Intuitively, small jumps makes the estimation 
more difficult and heavier penalization by the prior must be entertained (possibly by using 
a prior that depends on the sample size) to achieve model selection consistency at the cost 
of losing estimation accuracy. As seen in the proof of Theorem [51 the difficulty occurs when 
the size of the jump is of order 0{l/^/n), in which case it becomes difficult to detect the 
change-point. 

Nevertheless, as discussed above, the convergence is uniform if we further restrict our 
attention on the sub-class: 

We state the uniform convergence as a proposition without proof: 

Proposition 1. For any fixed 61,62 > 0, the rate of convergence is uniformly at least 
e„ = 0{l/y/E). That IS, for any M„ 00, sup^ge^i.^^ EzigTl"{e e Q^^'^^ : \\0 - e\\ > 
MnCn) 0. The property of model selection consistency is still satisfied in this case. 

On the other hand, the following result confirms that we cannot expect the posterior 
to converge uniformly over the class if the method can adapt to the number of change- 
points. Note that the theorem applies for any Bayesian posterior distribution for the change- 
point problem, not just the specific prior wc constructed. 

Theorem 5. Suppose the posterior distribution satisfies the model consistency condi- 
tion: n"(fc = fco) —^ 1 in Pq probability, then the maximal L2 convergence of 9 is necessarily 
slower than the parametric rate e„ = 0{\/y/n). That is, for some M„ 00, 

sup Ezifjli'^ie e : \\9 - 0\\ > Af„e„) ^ 1 . 

The above theorem demonstrated the tradc-off between function estimation and model 
selection for our Bayesian multiple change-point problems. 



3. Discussion 



In this paper, we investigated in detail some asymptotic properties of Bayesian multiple 
change-point problems when the noise level is assumed known. We proved estimation rate 
of convergence as well as model selection consistency of the posterior distribution. 

The main contribution of the paper is to show that the exact parametric rate is achieved 
for a restricted class of piece-wise constant functions and that this optimal rate cannot be 
achieved uniformly over the class. 

Our theory still leaves some gaps in between. For example, it is still unknown whether 
it is absolutely necessary to restrict the functions to have not too short segment lengths in 
order to achieve the optimal rate. The addition al restriction makes the local covering num- 
ber bounded in order to apply the Theorem in lGhosal et al. (20001) . Besides, the situation 
with unknown error level is of significant practical importance in which case one should 
also put a prior on the noise level. The convergence property in this case is still an open 
problem. 
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Appendix 

Some Lemmas 

In preparation for the proofs of the main results, we first collect some lemmas here. Lemma 
3] below shows that the local covering number is unbounded as remarked in the main text 
and is not used further in any other proofs. The constant C is used to denote a generic 
constant which might not be the same at different places. Note that since we are only 
considering uniformly bounded class of functions, the Hellinger distance, the Kullback- 
Leibler divergence, as well as the second moment of the likelihood ratio are all equivalent 
to the L2 norm of the regression function. In the following, we set i5o = minjminj — 
t'j_^\,miiij \a'j — flj^il} > 0, which bounds the segment lengths as well as the jump size 
from below. 

Lemma 1. Under condition (A), we have the lower bound for the prior concentration 
when e„ 0, 

U{9:\\e^eo\\<e^)>Cp{ko)el^"-^. 
Proof. When = Y^^^ ajHtj^i < t < t-j) e Qko with |aj - a^l < e„/2, 1 < j < fco and 

2 

\tj — t^l < g^i^i^ , 1 < J < fco ~ 1, it is easy to show that 116* — 6*011^ < e^. Since the prior 

density for (a, t) is bounded away from zero, we get 

n(0 ■.\\e-ea\\< e„) > p(fco)nfc„(0 : 11^ - ^oll < e„) > Cp(fco)ef 

Lemma 2. Let S' = \J (5o is defined immediately before Lemma\T\j. When e < 5' , 
we have that Iik{S € 6fc : 1 1^ ~ I ^ e) ^ Ce'^'^""^, k = 1, . . . , kmax, where C is a constant 
that depends on K,kmax cind Sq. For k > /cp; the bound can he refined to Ce^*^""^. 

Proof. First we consider the case k < kg and 9 G Qk- By the definition of Sq, the fco — 1 
intervals {t'j — So/2,t^ + 6o/2),j = 1, . . . ko — 1 are nonoverlapping. Thus there is at least 
one segment of 9 that includes one of these fco — 1 intervals. Thus the distance between 9 
and 6*0 is at least ^/S(JJW > '5', and thus Uk{9 e : \ \9 - 9o\ \ < e) = 0. 

When k > ko,9 G Qk and ||0 — 0o|| < e, for any j, let s{j) be the index of the interval 
[ts{j)-i,'ts{j)) which has the largest overlap with [t^_i,t^). Obviously the length of the 

overlap is at least So/kj^ax- This implies |as(j) — < ^\J^^^ (otherwise the squared L2 

distance between 9 and 0o is at least (flgQ) — a, > e^)- Similarly, let t{j) be the 

index of the change-point of 9 that is closest to t°, we have \tt{j) — < ^ (otherwise the 

squared distance will be bigger than ^{^)'^ = The above considerations give us ko 
constraints on the segments levels of 9 as well as fcg ~ 1 constraints on the change-point 
locations. Thus under assumption (A), the prior probability llk{9 E Qk ■ ||^^ ^ ^o|| < e) 
is bounded by Ce^''"~^. For refined bound, we consider k = ko + 1 only for simplicity. 
Without loss of generality, we assume t{j) ~ j and thus ||^ — ^oll !i ^ implies an additional 
restriction (a^^ — afe[,+i)^(l — ifeo) < e^. This gives us an additional factor of e in the bound. 

Lemma 3. logD(e,Q) < 61og(l/e) + c, for some constants b,c> that depends on K 
and kmax, where D{e,0) is the e— covering number of Q. 
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Proof. Choose a grid on the domain [0, 1) and another grid on [—K, K] 

A* = I ^j^j^ • I, i e n| n [0, 1], A, = {e • j, i e n [-K, K] . 

Let Q ~ {6 ^ <d,6 jumps only at points in and takes segment levels in Ay}. It is then 
easy to show that Q is an e— covering of O with covering number bounded by 

"'max / ^ 

The next lemma considers the local covering /p acking number. In p articular, Lemma [4] 
illustrates why we cannot apply Theorem 2.4 from lGhosal et al. (2000l ) to obtain the exact 
parametric rate on 0. 

Lemma 4. 

log L>(e/2, {6* e e, e < I |6l - 6*0 1 1 < 2e) > C/e^ 

for some constant C . 

Proof. Without loss of generality, we consider Oq =0. We construct a lower bound for the 
packing number. For simplicity, we assume l/(4e^) is an integer. Using the partition of 

the interval [0, 1) = [(« — l)4e^, i ■ 4e^) and construct the piece-wise constant fimctions 

6i{t) ~ — l)4e^ < t < i ■ 4e^). Obviously, for this set of functions, we have \\6i\\ = 2e 
and -9j\\ = 2^26. The lower bound for the covering number is obtained by the simple 
relationship between covering number and packing number. 



Lemma 5. For 2e < 6' = 




logD{e/2,{e e e^e < \\9 - Oo\\ < 2e) < C 

for some constant C that depends on S,Sq,K and kmax but does not depend on e. 

Proof. Suppose that ||^? — 6*011 < 2e. From the proof of Lemma[2l we know that each change- 
point of do has a corresponding change-point of 6 that satisfies \tt(j) — t^\ < l6e^/dQ. For 
any segment level of Oq, denote the corresponding index of the segment of 9 that has an 
overlap of at least S/2 by r(j), by similar argument as Lemma [21 \aj — a"^^.j| < 2^/2e/VS. 

To construct a covering, we partition [0,1) into nonoverlapping intervals. In the fol- 
lowing, M, B, N are sufficiently large integers to be chosen later. First, each interval 

— 16e^/5Q,t° -I- IGe^/cJp] is partitioned into Af subintervals with equal lengths. For the 
rest of [0, 1) we partition it into segments of lengths between 5/2B and 6/B. Obviously the 
total number of subintervals does not depend on e. These subintervals falls into two types: 
(i) the subinterval that contains some change-point of Oq; (ii) the subinterval that is entirely 
contained in some segment of ^o- The function class F that forms a covering is defined as the 
set of functions which is piece-wise constant with respect to the partition, takes a value of 
on type (i) subintervals and takes values of the form a° -I- i = —N, —{N — 1), . . . , N, 

on type (ii) subintervals if the subinterval is contained in segment j of ^o- The size of F is 
a constant independent of e and we show next that it is indeed a e/2-covering. 
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On subintervals of type (i), the squared L2 distance between F and 6 restricted on these 
intervals arc at most ^^k^axK^ ■ Type (ii) subintervals can further be divided into three 
types: (iii) it contains a change-point of 9 which is closest to some change-point of Oq] (iv) 
it contains a change-point of 9 other than those closest to some change-point of 60; (v) 
it is entirely contained in some segment of 9. On subintervals of type (iii) the squared 
distance is at most ^^kmaxK"^ ■ On subintervals of type (iv) the squared distance is at 
most -gfcmaa:( ^^"^ )^- On subiutcrvals of type (v) the squared distance is at most (f^^)^- 
Thus when B, N is large enough, wc have a e/2-covcring. 



Proofs of the main results 

Proof of TheoremUl We apply Theorem 2.1 in lChosal et al. f2000l ) with e„ = C^/\ogn/n. 
Condition (2.2) for that theorem is verified in Lemma[3l condition (2.3) is trivially satisfied 
and condition (2.4) is verified in Lemmas [H 

Proof of Theorem [H Theorem [1] immediately implies that the under-estimation proba- 
blity 11" (fc < fco) in Pq probability. For over- estimation, it is sufficient to show that 
^o^(/e.„ Sliy^^^-oW < Cn-(3fco-2+2«)/2) ^ for some < ( < 1/2, ^ndP^{J^^ §g)dn,{e) > 
(logn)-in-(3'=o-2+20/2) ^ 0, when k > fco. 

step 1. Let [/„ = {t e Tko ■■ t ^ t° + u,u e i^'^o+^uo = = 0, < c/n} with 
n|,^([/„) > c'n^^°^-^, where H^.^ is the prior measure on the locations of change-points. For 
any fixed t € U„, with probability converging to 1, by considering a small neighborhood 
of the maximum likelihood estimator d{t) for the given t as in Laplace approximation, we 
have 

PeiZ) , C PUht)(^K C P(ao,t)(^) 

p[f(Z) '^"^ ' ^ - 71^0/2 pn^Z) - pn^Z) • 

For any < S [/„, and conditional on {X;}, log ^ is normally distributed with mean 

— \f{tY and variance /(t)^, where f{tY = ~ "^j)^ ' '^i' ^^"^ number of 

Xi that falls into the subinterval [i°,fj) or \tj,t^) (depending on the sign of Uj). Since rij 

is Binomial distributed with mean less than c, /(t)^ = Op(^ logn) and thus log > 
— ^logri with probability converging to 1. Thus, with probability converging to 1, we have 
/e.„ Siy^-'^o W > nL(f/„);^n-^ = Cn-(3'=o-2+2,)/2. 

step 2. Letting (5„ = 2iognn3^o-2(i-;) . and e„ = Clogn/V", we have that 

/ ^^,d.m > (logn)-in-(3'=-2+2«)/^) 

- ^""^/ §^Md)>Sn)+Po-{f ^dn,{9)>S,,). 

J{||e-eo||<c„}nefc Po[^) J {\\e^eo\\>e^}nek PoK^) 

By the Markov inequality and Fubini's theorem, the first term above is bounded by j-Tr^d |0— 

^oll < £«) < j~^ri'"^^ ^ 0> where we have mad e use of Lemma [21 

For the second term, we apply Theorem 1 of lShen and Wasserman f200l[ ) with e in that 
theorem replaced by e„ defined above. Using 



/■\/2£ / 28 r- 

/ log(l/e) + c < W 6 log ^ + c • V2: 
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the entropy condition in that theorem can be verified for e = e„. Tlius when C is large 
enough, the second term also converges to 0. 

Proof of Theorem\^ We apply Theorem 2.4 in iGhosal et al. ('20001) using e„ = A/ ^Jn 
with A sufficiently large. Condition (2.7) for that theorem is verified in LemmaEl and (2.8) 
is trivially satisfied. Now we verify (2.9), for which we need to bound ^^'"'n(ne-9o^||'<f ^)'^""' ' 
When j < 5'^/n/2A^ "^jcn < S' and Lemma [5] can be directly applied to obtain that 
n,.(^ eei:\\0- 0,\\ < 2je„) < C(je„)''°-', and we get "^^'^nmi-^'ngr"^ ^ ^ 
Cexp{A^P/2). For j > 5'^./2A, we bound the probability by 1, and "^^n(ne-eo1|<lT"^ - 
C(l/e„)3'=^°-2 < Cexp{A^f/2) for this range of j. 

Proof of Corollary [7J By Theorem [2l we can assume the number of change-points of 9 
is also ko. Then maxi<,<fcQ \U - t\\ > M„e2 implies that - 6*0112 > (5o/2)2Af„e2 . Thus 
n„(maxi<,<fe„ \U - t\\ > Mntl) < n„(0 G : \\e - 9q\\ > {5Q/2)^,en) ^ 0. 

Proof of Theorem^ Fixing one i G = {< G Tk^ ■ maxj \tj - tj\ < C/n}, denote 
the maximum likelihood estimator for a by d(t). Let z ^-"^d njj^ be the posterior 
measure for a conditioning on t and the posterior measure for t respectively. The classical 



Bernstein- von Mises Theorem implies that -Eo||n 



a\t,Z 



N{a{t)Jo)\\Tv 0. We have that 



^^o||n:^|z-^^(«(^o),/o)||Ty 
< II / n^|,,2dnj]^-7V(a(io),/o)||Ty 



in. 



Niaito),Io)\\TV 



(I) can be bounded by 



En 



K\t,zdK\z-N{a{to),Io)\\TV 



< Eo 



||n^|,^^-Ar(a(t),Jo) 



I TV 



t\z 



En 



-n 



\\N{a{t),Io)-N{a{h),Io) 



\TV 



The first term converges to zero by the boundedness of the TV norm and the Fubini's 
theorem. The second term converges to zero since ||a(i) — a(io)|| = Op(l/^/ri). Letting n 
goes to infinity and then C goes to infinity, we see that £'o||n"|2 — A^(a(to), ^o)| |tv 0. 
Proof of Theorem O Fix any number M > and 7 > 2M. Define 6*0 = and 0„ = 
-^(5 < i < 1), a function with a single change-point and jump size We trivially have 



116* — 6'„|| > > for all 6* G 8i (i.e. 6* is a constant function). Under 6*0, the posterior 
probability on 0i converges to 1 by Theorem [2l This gives us 



Ez\eo^^' 



\\9-9,,\\ > M/V^) > Ezieo^'\9 : \\9^9„\\ > M/V^,9 e Gi) > Ezie.Tl^iQi) ^ 1. 



Since the measure Pq induces by 9q and the measure Pg ind uced by 9n a re mutually 
contiguous (this is a straightforward extension of Theorem 7.2 in IVaart (19981) 1. we have 



sup Ezwn" {e G : ||0 - 6*11 > Afe„) > E, 



z\e, 



Ge'':||6^-(?„||>Me„)^l. 



Since this is true for any Af , it is also true for some slowly diverging sequence M„ as in the 
statement of the theorem. 
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